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AW INTERVAL ESTIMATE FOR STATISTICAL INFERENCE ABOUT TRUE SCORES 
Frederic M» Lord and Martha S. Hamilton 

Abstract 

A numerical procedure is outlined for obtaining an interval estimate 
of true score. The procedure is applied to several sets of test data. 
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AN INTERVAL ESTIMATE FOR STATISTICAL INFERENCE ABOUT TRUE SCORES* 



We wish to infer the true score of an individual examinee in a group 
of examinees from his observed score. The distribution of observed scores 
for a given true score is assumed to be binomial. If the distribution of 
true scores were known, the usual (Bayes) estimator of true score from ob- 
served score would be given by the regression of true score on observed 
score. If the distribution of true scores is unknown, which is always the 
case with real data, this regression is not uniquely determined by the 
observed-score distribution, even in an infinitely large population of 

examinees (Lord & Novick, 1968 , section 23 . 5 ). 

In practice, the regression function of observed-score on true score 
is frequently assumed to be linear. This assumption can be correct only if 
the unconditio na l observed-score distribution is negative hypergeometric. For 
any set of real data, then, the question arises — what limits or bounds can be 
placed on this regression under the binomial error model without making linearity 
assumptions? This paper presents a technique for computing an interval esti- 
mate of the regression function of true score on observed score under the bi- 
nomial error model. The procedure is not simple . Our main interest here is 
to demonstrate the range of reasonable estimates of true scores than can be 
obtained from a set of data. 

The same technique is applicable to problems outside of mental test 
theory whenever there is a set of true values and a set of binomial 
errors of measurement. This more general empirical Bayes problem, not 
related to mental test theory, is discussed separately (Lord, 1971). 

*This research was sponsored in part by the Personnel and Training 
Research Programs, Psychological Sciences Division, Office of Naval 
Research, under Contract No. N00014-69-C-0017, Contract Authority 
Identification Number, NR No. 150-303, and Educational Testing Service. 
Reproduction in whole or in part is permitted for any purpose of ■‘•’r.e 
United States Government. 
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The Model 



The observed score x is assumed "to be an integer 0,1,2, •••,n , 
where n is the number of items in the test • For each x there is an 
unobservable true score £,0<£<1» difference between x and 

n£ represents error of measurement. For a given £ , x has the binomial 
distribution 

h(x|0 = (£) £ X (1 - , x=0,l,...,n . (1) 

A sample of N observations on x is drawn at random from some population 
of pairs (x, £). We observe x , but not the corresponding £ . We wish 
to estimate the true score t; corresponding to a particular observed score x. 

Let G(£) be the unknown cumulative distribution function of true 
scores for the population from which the N sample observations were drawn. 

The relative frequency distribution of observed scores for the population 
may be written 

1 

•^(x) = / h(x|£)dG(£) , x=0,l, ...,n . (2) 

0 

If G(0 were known, the usual Bayes estimate of the true score for a particu- 
lar observed score would be the regression of true score on observed score, 

1 

H. | x = f ?h(x|C)dG(S) , x - 0,1, . (3) 

G 0 



If a good estimate G(£) of G(t;) can be found, then the corresponding 
estimate u„ i can be used as the empirical Bayes estimate of £ for 
any particular x . A number of "techniques arc available for constructing 
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reasonable estimates l^j x from the observed- sc ore distribution (for 
example, Robbins, 1956; Maritz, 1966; Copas, 1969; Griffin & Krutchkoff, 
1971); but they are of unknown accuracy for any given N and n • The 
technique presented here constructs an interval with lower bound 



^ and upper bound (1^ within which n^ x must lie in order to be 

"reasonably consistent" with the sample of observed scores. 

Let the sample relative observed frequency distribution be f (x) , 

x = 0,1,... ,n . Consider to be the 1 - a percentile of the chi- 

square distribution with n degrees of freedom. A G(0 will be considered 

reasonably consistent with the data if the chi-square between the corresponding 

2 

♦ (x) defined by (2) and the given f(x) is less than or equal to s 

G 



n H[f(x) - ♦.(x)]' 



K. - 2 - 

G x=0 
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Let T be the set of all cumulative distribution functions G(£) 

0 t 

that satisfy (4). The problem to be solved may then be stated as follows : 
For each x = 0,1, ...,n , find , the smallest ^| x , and £ ax , the 

Largest ^Ix obba i nab -4 e ^ rom (3) under the restriction that G(£) be 



in r 



a 



By its construction, the interval (t& x >£ ax ) can be considered a 
confidence interval. With probability at least 1 - a , it will contain 
the true value of the regression in the population from which the sample 
was drawn. This procedure for constructing a confidence interval is not 
entirely satisfactory, since only a lower bound for the confidence level is 
known. Until better procedures are developed, however, the interval pro- 
vides more information about the accuracy of inference about true scores than 

would otherwise be available. 
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Constructing the Confidence Interval 

Substituting (l) into (2) and expanding gives 

VOO - (") V (" J x )(-l) r u , x = 0,l,...,n , (5) 

G r=0 x+r 

•where p. is the k -th moment of G(£) about the origin. 

K 

Substituting (l) and (5) in (5) and again expanding gives 



T (» 5 x )(-D \ +r+1 

r=0 



C x n-x 

s ( n ; x )(-i) r n 

r=0 



x+r 



x = 0, 1, • « »,n 



( 6 ) 



Using a theorem by Markov (see Posse, 1886, sections V8 and V9; or 
Karlin & Shapley, 1955) and equation (6) it can be shown (lord, 1971) 

that u or lL is attained for a given x only when G(£) is a step 
'-Cfcc *ux 

function. A step function is a cumulative distribution function which arises 
when discrete probabilities g_^, v=l, . . • »V are concentrated at points 
t v=l ...,V . The theorem also proves that if n , the number of test 
items, is even, V , the number of different points, will be at most 
£ + l. The situation is similar when n is odd, but will not be de- 
tailed here. In addition, the theorem by Markov shows that if (n-x) 
is even, p^ x is attained only when the smallest is 0.0, and p ax 

is attained only when the largest £ v is 1 . 0 . Similarly, if (n - x) is 
odd, p^ x is attained only when the largest is 1.0, and p^ is 

attained only when the smallest is 0.0. 

Thanks to Markov, the problem has now taken on a simpler form. To find 
^ttx ° r ^Cfec 3 on ^ r H un ^ nown true scores ^ need be found. Similarly, since 
the sum of all probabilities, g^ , must be 1, only ^ unknown probabilities 

need be found. The problem simplifies further since it can be shown (Lord, 1971) 

2 2 

that the solution lies on the boundary defined by = X^_^ , therefore the in- 
equality of equation (4) can be replaced by strict equality. 



R 



When G(£) is a step function, (3) can be written as 



2 SySv H(x|$ v ) 
_ V=1 

^ |x “ V 

s g v »(xU v ) 

V=1 



( 7 ) 



where V = j| + 1 • The problem is to maximize or minimize §i ven b y 
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equation (7), subject to the restrictions imposed by (4), by E g y = 1.0 

v-1 

and by the inequalities 0 < g^ < 1.0 , 0 < < 1*0 . This problem can 

be solved numerically for any given observed score distribution by 
mathematical programming algorithms implemented on a computer. 

The algorithm used to find the numerical solution to the problem was the 
sequential unconstrained minimization technique (SUMT) developed by Fiacco and 
McCormick (1968, Chapter 4) and implemented by M. Hamilton. This algorithm 
carries out a constrained minimization of a function (equation (7)) by per- 
forming a series of unconstrained minimizations. The unconstrained minima con- 
verge to the constrained minimum. Each unconstrained minimization minimizes the 
sum of the function and some penalty function. The penalty function is constructed 
to be large when a constraint is violated and small when it is not violated. The 
penalty function used here restricts G(£) to T a . The other restrictions were 
handled by simpler means. The required minimization of the unconstrained function 
was accomplished by the Fletcher -Powell-Davidon algorithm (Fletcher & Powell, 

1963), programmed by Jflreskog 1967, (section 8) and modified by Hamilton . All 
computations were performed on an IBM 360/65 in double precision. 

Results 

This procedure has been applied to a variety of mental test data. The- tests 
presented here were selected for their unusual features. The values of CL were 

chosen for convenience of computation. 
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Table 1 . Observed cumulative frequency distribution and corresponding 
interval estimates ( a = 0.086 ) for the regression of true 
score on observed score. 



Cumulative Dis- 



X 


tribution 


50 


1.000 


2 k 


•999 


18 


. 9^5 


12 


• 7^1 


6 


.249 


0 


.001 



A 


Interval Estimate 




of the Regression 


•970 


.606-1.000 


•713 


.595-. 792 


' 5 kk 


.498 -.596 


•371 


.342 -.595 


.237 


.216 -.255 


•137 


.009 -.220 




4k 



\ 




Data set 1. One such test consisted of 30 five-choice items administered 



to 2385 examinees. Table 1, column 2, shows the cumulative observed frequency- 
distribution after random responses have been supplied for omitted items. This 
test is of particular interest since one -fourth of the examinees had scores at 

the chance level (x = 6) or below, with one-sixth of the scores below chance. 
The presence of so many people at or below the chance level raises a number 

of questions about the distribution of true scores. Are most or many of the 
true scores also at or below the chance level? Do some people score systemati- 
cally lower fhgn if they responded at random? What proportion of examinees 
can safely be assumed to have true scores above chance level? 

The last column shows, for selected values of x , the interval estimates 
of the regression obtained by the method outlined in this paper for 0! = 0.086 . 
Since the reg^v'^ion function is to be used as giving the estimated true 
score for a glK ? observed score, one can see the range of estimates that could 
reasonably be so used. The intervals demonstrate clearly that real differences 
exist on the dimension tested in spite of all the guessing. One cannot rule out 
the presence of true scores below the chance level, or of very high true scores. 

Ibr observed scores of 12 and 6, the intervals are tolerably short. It 
is interesting to note that for x < .2n , the interval estimate lies above 
x/n } for x > .4n , the interval estimate lies below x/n . This would seem 

to be a rather extreme manifestation of regression towards the mean. 

It is easily shown that a straight-line regression can fit inside all of 

the intervals. However, this is not a sensitive test for linearity of regres- 
sion. Under the binomial error model considered here, linearity' necessarily 
leads to a negative hypergeo metric distribution of observed scores (Lord & 
Novick, 1968 , section 23*6). To test for linearity, a negative hyper- 
geometric distribution was fitted to the observed score distribution. 

The X 2 obtained for this fit was far beyond the tabled 99*9 percentile. 

Thus, the hypothesis of a linear regression of true score on observed score 



cannot be maintained for these data. 

The third column of Table 1 gives the (nonlinear) regression, obtained 

some years ago by a very different approach (Lord, 1969 ), for a &(£) that 

2 

produced a good ; fit to the observed-score distribution (the X between 

and f(x) was at the 60th percentile, with 19 degrees of freedom). It is reas- 
suring to find that this regression lies well within the interval estimates 

shown in the last column. 

Data set 2. The technique was applied to another set of data consisting 
of the responses to 38 five-choice engineering items administered to 717 ex- 
aminees. The mean number-right score on this subtest was 12. The subtest has 
spectacularly low reliability: the Kuder-Richardson coefficient KR^q is only 

0.35. (The reason for such low reliability may be that the questions covered 

different engineering specialities --such as mechanical, electrical, or chemical 
engineering — but most examinees were familiar with only one speciality.) 

Interval estimates of the regression of true score on observed score 
were computed for five observed scores, with OL = 0.01 . The results 
are shown below: 

Observed score x : 2 7 12 17 

Cumulative distribution of x : .001 *073 * 591 .934 

Interval estimate of the regression: .022-. 321 .246-. 321 . 289 -. 332 .315" *407 

All of these intervals contain at least one value in the range 0.32 to 0.33, 
which leaves open the remote possibility that examinees with observed scores 
through ~>ut the range 2 < x < 22 may all have about the same true score. This 
lack of discrimination is in agreement with the low test reliability. Zero re- 
liability would imply that all true scores were identical, the variation of 
observed scores being entirely due to errors of measurement. A direct test of 
the hypothesis of zero reliability is called for if this hypothesis is of interest. 



22 

•997 

• 330-. 596 
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Data set 3. The effect of large sample size on the width of the interval 
estimate was investigated by using the scores of 137,052 examinees on a test 
composed of 50 five-choice math items. Using (X = 0.05 > the interval estimate 
computed for the median ( x « 25 ) of the distribution of number -right scores 
was found to be 0.496-0.509* a satisfyingly short interval. Calculations for 
other x values were not done (because of the expense, due to the large n). 

Data set 4. In order to check further the efficacy of the interval esti- 
mates of regression, a set cf hypothetical data was used. The observed rela- 
tive frequency distribution was constructed by selecting 1000 cases at random 
from a negative hypergeometric distribution with n = 24 • Table 2, column 2, 
shows the cumulative frequency distribution obtained. 

The fifth column displays the interval estimates of the regression for seven 
values of x , with CU = 0.0375 . Since the population distribution from which 
the sample was drawn was negative hypergeometric, the data are consisistent under 
the binomial error model with the assumption that the population regression is 
linear. The actual linear regression for the population was computed and is 
shown in column 4 of the table. Clearly, the interval estimate in column 5 
recovers the information about the population linear regression. In fact, 
the values of the population linear regression differ from the midpoints of • 
the intervals by a maximum of 0.019* 

Data set 5» The third column of this table displays the cumulative 

frequency distribution of 50 cases that were selected at random from the 1000. 

Column 6 .shows the corresponding interval estimates of the regression. As 

expected, the intervals are much wider than those for the original 1000 cases, 
but not 1000 / *7 50 = 4.4 times as wide. The width of the interval is doubled 

or tripled. 
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Table 2. Observed cumulative frequency distribution and interval 
for hypothetical data, 0! « 0.0375 • 



X 


Cumulative 
Distribution 
of x, N=1000 


Cumulative 
Distribution 
of x, N=50 


^ 1 * 


^-ax^Qix^ 
for N=1000 


24 


1.000 


1.00 


.900 


.765-998 


20 


.954 


.96 


.767 


.705-. 822 


16 


•795 


.72 


.633 


.575-675 


12 


.523 


.48 


.300 


.467-. 558 


8 


.265 


.22 


.567 


. 310 -.409 


4 


.072 


.06 


.233 


. 185-. 297 


0 


.002 


.00 


.100 


. 009-. 216 



estimates 



felx^O'x' 

for N=50 

.643-1.000 
. 611-. 868 
•532 -. 742 

.404 -.618 
. 258-. 528 
. 093- 454 
.000-. 454 
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